In [None]:
# Install required packages
!pip install --upgrade --quiet natural-pdf[ai,ocr-export]

print('✓ Packages installed!')

**Slides:** [slides.pdf](./slides.pdf)

# Let's ask questions

Time for some AI magic. We're using **extractive question answering**, which is different from LLMs because it pulls content *from the page*. LLMs are *generative AI*, which take your question and generates *new* text.

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show()

In [None]:
result = page.ask("What date was the inspection?")
result

Notice it has a **confidence score**, which makes life great. You can also use `.show()` to see where it's getting the answer from.

In [None]:
result.show()

It automatically doesn't show you answers it doesn't have much faith in. Let's ask for the **Summary**.

That does NOT mean it's always accurate, though. Using the words on the page makes it a lot easier. **How should we ask about the number of violations?**

We can also ask for **muliple things at once.**

There are better ways to extract structured data, though.

## Structured data generation

### Using extractive Doc Q&A (same as `.ask`)

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

You can use `page.extract` to (attempt to) extract structured data.

In [None]:
page.extract(["site", "date", "violation count", "inspection service", "summary", "city", "state"])

In [None]:
page.extracted('city')

## Leveraging an LLM for structured data

Sometimes you want an opinion from an LLM, though. You want it to write things that aren't in there, or piece together something complicated. It's worth the potential for hallucinations!

Below we're using Google thanks to its [OpenAI compatibility](https://ai.google.dev/gemini-api/docs/openai).

In [None]:
import os
from openai import OpenAI

# Initialize your LLM client
# Anything OpenAI-compatible works!
client = OpenAI(
    api_key=os.environ["GOOGLE_API_KEY"],  # Your API key
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"  # Changes based on what AI you're using
)

fields = ["site", "date", "violation count", "inspection service", "summary", "city", "state"]
page.extract(fields, client=client, model="gemini-2.0-flash-lite") 

In [None]:
dict(page.extracted())

### Very intense structured data extraction

Instead of being kind of loose and free with what you want, you can also get MUCH fancier and write a Pydantic model. It will not only send the column names you want, but also little descriptions and demands about strings (text), integers, floats and more.

You can find more details [here](https://platform.openai.com/docs/guides/structured-outputs).

In [None]:
from pydantic import BaseModel, Field
from openai import OpenAI

# Initialize your LLM client
# Anything OpenAI-compatible works!
client = OpenAI(
    api_key=os.environ["GOOGLE_API_KEY"],
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

# Define your schema
class ReportInfo(BaseModel):
    inspection_number: str = Field(description="The main report identifier")
    inspection_date: str = Field(description="The name of the issuing company")
    inspection_service: str = Field(description="Name of inspection service")
    site: str = Field(description="Name of company inspected")
    summary: str = Field(description="Visit summary")
    city: str
    state: str = Field(description="Full name of state")
    violation_count: int

# Extract data
# page.extract(schema=ReportInfo, client=client, model="gemini-2.5-flash-lite") 
page.extract(schema=ReportInfo, client=client, model="gemini-2.5-flash") 

In [None]:
page.extracted() 

In [None]:
dict(page.extracted())

In [None]:
page.extracted('inspection_date')

## Table extraction with LLMs

In the example below, we're saying "Using Gemini, provide a violations table - each row should have a statute, a description, a level, and a repeat-checked

In [None]:
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import List, Literal

client = OpenAI(
    api_key=os.environ["GOOGLE_API_KEY"],
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

class ViolationsRow(BaseModel):
    statute: str
    description: str
    level: str
    repeat_checked: Literal["checked", "unchecked"] = Field("Whether the checkbox is checked or not")

class ViolationsTable(BaseModel):
    inspection_id: str
    violations: List[ViolationsRow]

page.extract(schema=ViolationsTable, client=client, model="gemini-2.5-flash") 

Note that when we look below... **it didn't do the checked/unchecked correctly!**

In [None]:
import pandas as pd

data = page.extracted()
pd.DataFrame(data.model_dump()['violations'])

## Figuring out how to manage those pesky checkboxes

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show(width=500)

We can use .extract_table() no problem to get *most* of the columns, but we really really want those checkboes!

In [None]:
import pandas as pd

df = page.extract_table().to_df()
df

Let's find all of the boxes below the "Violations" header...

In [None]:
boxes = (
    page
    .find(text='Violations')
    .below()
    .find_all('rect')
)

boxes.show(crop=True)

Let's go through each box: **do you have a line inside of you?**

In [None]:
rect1 = boxes[1]
rect1.show(crop=True)

In [None]:
rect1.find('line')

In [None]:
rect2 = boxes[4]
rect2.show(crop=True)

In [None]:
rect2.find('line')

We can use `.apply` to go through each box and say 'yes' if there's a line, and 'no' otherwise.

In [None]:
(
    page
    .find(text='Violations')
    .below()
    .find_all('rect')
    .apply(lambda box: 'yes' if box.find('line') else 'no')
)

In [None]:
df['repeat'] = (
    page
    .find(text='Violations')
    .below()
    .find_all('rect')
    .apply(lambda box: 'yes' if box.find('line') else 'no')
)
df.head()

## Classification

But what if it's an *image* of a rectangle that's checked or unchecked? No worries, AI to the rescue yet again! And this time it's a *local model*, something where you don't have to rely on ChatGPT or Anthropic or any of those.

In [None]:
rect1 = page.find_all('rect')[2].expand(-1)
rect1.show(crop=True)

We can use `.classify` and `.category` to see whether it's a square or an X. Or checked vs unchecked? ...or blank or an X?

In [None]:
rect2 = page.find_all('rect')[5].expand(-1)
rect2.show(crop=True)

In [None]:
boxes = (
    page
    .find(text='Violations')
    .below()
    .find_all('rect')
    .expand(-1)
)
boxes.show(crop=True)

In [None]:
(
    boxes
    .classify_all(['blank', 'X'], using="vision")
    .apply(lambda r: r.category)
)

In [None]:
df['repeat'] = (
    boxes
    .classify_all(['blank', 'X'], using="vision")
    .apply(lambda r: r.category)
)
df

# Putting things in categories

## Categorizing an entire PDF

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show(width=500)

What can we classify the entire PDF as? Maybe a... slaughterhouse report? A dolphin training manual? Something about basketball or birding?

In [None]:
pdf.category_confidence

## Classifying pages of a PDF

Let's take a look at a document from the CIA investigating whether you can **use pigeons as spies**.

In [None]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/ire25-natural-pdf/raw/refs/heads/main/cia-doc.pdf")
pdf.pages.to_image(cols=6)

Just like we did above, we can ask what category we think the PDF belongs to.

In [None]:
pdf.classify(['slaughterhouse report', 'dolphin training manual', 'basketball', 'birding'], using='text')
(pdf.category, pdf.category_confidence)

But notice how all of the pages look very very different: **we can also categorize each page using vision**.

In [None]:
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='vision')

for page in pdf.pages:
    print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")

And if we just want to see the pages that are diagrams, we can `.filter` for them.

In [None]:
(
    pdf.pages
    .filter(lambda page: page.category == 'diagram')
    .to_image(show_category=True)
)


And if that's all we're interested in? We can save a new PDF of just those pages!

In [None]:
(
    pdf.pages
    .filter(lambda page: page.category == 'diagram')
    .save_pdf("diagrams.pdf", original=True)
)